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A  SYSTEMS  ENGINEERING  APPROACH  TO  RELIABILITY 


Alexander  W.  Boldyrtff 

Consol  tent  to  The  RAND  Corporation,  Santa  Monica,  California 
Professor  of  Engineering  and  Production  Management,  UCLA 


Systems  engineering  may  be  defined  as  an  activity  aiming  at  optimum 
operation  of  existing  complex  integrated  systems  or  optimum  design  and  devel¬ 
opment  of  future  systems. 

This  activity  must  necessarily  start  with  a  good  understanding  of  the 
customer's  mission,  of  the  state  of  the  art  as  a  function  of  the  time,  ana 
of  the  budgetary  and  time  constraints. 

System  optimization  can  then  be  reduced  essentially  to  seeking  the  best 
balance  betveen  performance,  reliability,  and  cost. 

Here  the  cost  must  be  measured  both  in  time  and  in  dollars,  and  the 
latter  must  include  not  only  the  development  and  manufacturing  costs,  but 
also  the  cost  of  handling  and  transportation,  storage  and  surveillance, 
support  equipment,  maintenance,  replacement,  logistics,  and  personnel  and 
their  training. 

The  old  definition  of  reliability  as  the  probability  of  failure-free 
operation,  for  a  specified  length  of  time,  and  in  a  specified  environment, 

1 6  both  Incomplete  and  inadequate. 


Any  views  expressed  in  this  paper  are  those  of  the  author.  They  shoj''* 
not  be  interpreted  as  reflecting  the  views  of  The  RAND  Corporation  or  the 
official  opinion  or  policy  of  any  of  its  governmental  or  private  research 
sponsors.  Papers  are  reproduced  by  The  RAND  Corporation  as  a  courtesy  to 
members  of  its  etaff. 

This  paper  was  prepared  for  presentation  at  the  Uth  National  Conference 
of  the  Aircraft  and  Missile  Division  of  American  Society  for  Quality  Control, 
Los  Angeles,  California,  November  9#  1961. 
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Likewise  the  computation  of  reliability  of  a  complex  system  from  the 
reliabilities  of  its  components  Is  often  of  questionable  utility. 

Reliability  muct  be  approached  as  primarily  a  problem  of  design,  not  a 
mere  exercise  in  elementary  statistics.  The  primary  goal  should  be  achieve¬ 
ment  of  reliability;  reliability  measurement  and  prediction,  vhile  important, 
are  merely  secondary. 

As  an  example,  let  us  assume  a  series  system  of  n  components,  such  that 
the  failure  of  any  one  component  results  in  system  failure.  It  is  usual  to 
estimate  the  reliability  of  such  a  system  in  terms  of  the  (geometric)  mean 
component  reliability. 

Consider  then  a  system  of  no  more  than  500  components.  For  systems 
reliabilities  of  0.70  and  0.95 ,  the  mean  component  reliabilities  are  0.9S9 29 
and  0.99995,  respectively.  Thus,  it  may  be  argued  that  the  reliability  of  tne 
system  can  be  increased  from  0.70  to  0.95  by  an  improvement  in  mean  component 
reliability  of  only  0.07  per  cent.  But  of  course  this  reasoning  is  misleading 
Component  improvement  means  decreasing  the  probability  of  failure.  In  this 
example,  to  Improve  system  reliability  from  0.70  to  0.95/  requires  decrease. 
the  probability  of  component  failure  from  0.00071  to  0.00005,  and  this  mearu 
elimination  of  more  than  90  per  cent  of  failures  of  components  which  are 
already  highly  reliable.  This  can  only  be  done  through  extensive  and  expensive 
testing  to  determine  the  assignable  causes  of  failure,  the  actual  physical 
mechanisms  of  failure,  and  subsequent  redesign,  followed  by  more  testing.  To 
carry  out  such  a  program  for  each  of  the  very  large  number  of  different 
components  of  many  complex  systems  nov  in  the  process  of  development  is 
patently  impossible,  even  at  a  prohibitive  cost  in  time  and  money. 

While  I  do  not  underestimate  the  importance  of  component  reliability  im¬ 
provement  programs,  I  do  feel  that  such  programs  alone  are  not  enough. 


-3- 


Furthermore,  the  whole  concept  of  component  reliability  is  hard  to  define. 
Unlike  such  physical  constants  as  mass,  volume,  density,  etc.,  component 
reliability  cannot  as  a  rule  b^  described  by  a  single  number.  Thus,  the  same 
vacuum  tube  may  have  a  mean  life  to  failure  of  10,000  hrs  in  ground  equip¬ 
ment,  P500  hrs  in  aircraft,  and  only  13  minutes  in  a  missile. 

It  Vs  for  the  above  reasons,  and  because  of  a  general  lack  of  faith  in 

numerologj\  that  I  have  been  concerned  during  the  past  twelve  years  with  a 

\ 

\ 

systematic  s^idy  of  the  problem  of  reliable  system  design  using  existing  smd 

therefore  none  too  reliable  components. 

This  le  then  the  main  theme  of  .my  presant-  paper,.'  The  central  point,  of 

A 


course, v is  that  reliability  must  be  sought  as  an 


integral^ 


and  perhaps  the  most 


Important  part,  *of  the  over-all  system  design. 


—I-  shall  begin  with  a  listing  of  what  MOUSES  should  be  the  principal 
areas  of  concern  to  a  reliability  engineering  organizational  , 

1.  Conceptual  design.  This  is  where  reliability  improvements  can  be 
gained  in  big  chunks.  Instead  of  infinitesimals,  through  relaxation  in  un¬ 
necessarily  stringent  performance  requirements,  with  the  resulting  reduction 
in  system  complexity. 

2.  Malfunction  reporting  in  test  and  field. 

3.  Environmental  tests  in  laboratory  and  field. 

4.  Reliability  analysis  and  prediction. 

5.  Determination  of  assignable  causes  of  failure,  and  of  the  actual 
physical  mechanism  of  failure. 

6.  Recommendations  of  redesign. 

7.  Shop  follow-up  and  project  coordination. 

8.  Recommendations  of  optimum  maintenance  and  logistics. 


V 


>*  /  ^ 
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9.  Education: 

a)  of  reliability  engineers 

b)  of  management 

c)  of  the  customer 

d)  of  the  designers. 

10.  Standards,  vendor  evaluation,  and  receiving  inspection. 

11.  Manufacturing  inspection  and  quality  control. 

12.  Handling  and  transportation. 

13.  Storage  and  surveillance . 

Procedures  for  sound  operational  use. 

- -A  A  '  -  '■  y 

— -I  shall  -next  give  a  short  list  of  some  general  methods  of  Increasing 

I  A 

system  reliability  whlch^i-WAisw  to  be  basic  in  designing  for  reliability. 
These  are  as  follows: 

1.  Critical  examination  of  systems  objectives  in  the  light  of  customer' 
mission,  the  state  of  the  surts,  and  the  costs  involved. 

a)  Avoidance  of  excessive  performance  requirements  and  consequent 
reduction  in  complexity. 

b)  Avoidance  of  multifunction  systems,  whenever  these  functions  can 


be  separated. 

2.  Designing  for  realistic  environment. 

3.  Designing  for  producibility  and  maintainability. 

4.  Designing  with  a  clear  understanding  of  the  conditions  of  opera' 
tlonal  use. 

3.  Using  tested  (proven)  components  whenever  possible. 

6.  Using  standard  mechanisms  and  circuitry  vbenever  possible. 

7.  Maximum  standardization. 


* 


3.  Optimum  use  of  modular  design. 

9.  Development  of  reliable  failure  detecting  equipment. 

10.  Optimum  use  of  redundancy. 

11.  Widest  possible  use  of  fail-eafe  design. 

12.  Provisions  for  adequate  customer  training. 

13.  Provision  of  adequate  support  equipment. 

Ik.  Intelligent  use  of  approximate  solutions  as  a  means  of  simplifying 
mechanization. 

13.  In  manned  systems,  provisions  for  maintenance  vlthout  interruption 
of  operation. 

Time  will  not  permit  a  detailed  discussion  of  each  of  these  subjects. 

Instead  I  would  like  to  concentrate  on  Just  one  of  these — that  of  the 
uses  of  redundancy. 

It  has  been  generally  recognized  that  in  the  case  of  aircraft,  after 
a  half-century  of  experience,  acceptable  reliability  was  attained  only 
through  redundant  design,  so  that  if  one  component  fallod  another  could 
assume  its  function.  It  is  because  of  this  that  while  some  kind  of  failure 
(calling  for  emergency  service  outside  of  normal  maintenance  routine)  may 
occur  In  aircraft  every  seven  and  a  half  hours  of  flying,  the  ratio  of  such 
failures  to  disasters  is  ten  thousand  to  one.  Not  so  for  those  systems  that 
are  strictly  serial  in  nature.  In  such  systems  svsry  component  must  function 
properly  for  successfuly  system  operation,  so  that  the  failure  of  a  singlo 
component  falls  the  system.  The  guided  missiles  and  many  other  weapon  systems 
are  examples  of  systems  that  are  almost  entirely  serial  in  nature  in  this 
sense.  Here  the  ratio  of  failures  to  disasters  is  one  to  one. 

It  is  customary  to  express  the  reliability  of  a  series  system  by  the 
product  of  the  reliabilities  of  tbs  components.  Tbs  realism  of  this 
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assumption  it  open  to  serious  doubt,  except  for  those  cases  where  the 
Individual  components  have  reliabilities  of  the  same  magnitude,  and  component 
failures  may  be  treated  as  independent  events. 

Nevertheless,  with  this  assumption,  ve  can  readily  see  the  dramatic  v&y 
in  which  complexity  decreases  reliability.  Thus  100  components  with  mean 
reliability  of  0.90,  when  operating  In  series  have  a  system  reliability  of 
only  0.000026,  or  practically  zero. 

However,  if  it  were  possible  to  replace  each  component  of  this  hypo¬ 
thetical  system  by  three  components  in  parallel,  the  new  system  would  have 
a  reliability  of  0.90. 

Why  is  it  then  that  paralleling  or  redundancy  is  not  weed  more  widely? 

Then  are  several  good  reasons: 

1.  In  many  cases  redundancy  is  either  impossible  or  Impractical. 

2.  In  all  cases  the  use  of  redundancy  implies  penaltiee—of  added 
volume,  weight,  power,  environment  control,  increased  frequency  of  component 
failure  and  corresponding  increase  in  maintenance,  spares,  etc.  Likewise, 
uncritical  use  of  redundancy  may  seriously  affect  performance.  For  example, 
the  increase  in  weight  will  decrease  the  range. 

Nevertheless,  redundancy  is  used  extensively  in  the  design  of  manned 
aircraft  and  is  standard  practice  in  the  design  of  more  critical  subsystems 
of  special  weapons. 

Simple  paralleling  of  components  is  not  the  only  type  of  redundancy, 
and  I  shall  now  describe  several  other,  less  familiar,  types. 

1st  Example 

In  a  certain  subsystem  successful  operation  required  simultaneous  proper 
functioning  of  some  thirty-five  identical  components. 
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With  a  mean  component  reliability  of  0.99>  the  subsystem  reliability 
was  0.70 — unacceptably  low. 

The  solution  of  the  difficulty  was  found  in  the  subsystem  redesign  suer, 
that  successful  operation  required  that  any  thirty-four  components  worked, 
tolerating  the  failure  of  any  one  (but  not  more  than  one)  of  them. 

This  simple  expedient  raised  the  reliability  to  0.9^« 

2nd  Example 

Many  types  of  components  are  characterised  by  being,  at  any  given  time, 
in  eith-r  one  of  tvo  mutually  exclusive  states: 

open  or  closed 
off  or  on 

non-conducting  or  conducting 
0  or  1 

Proper  operation  of  such  devices  consists  in  the  transition  from  the 
first  state  to  the  second  at  a  specified  time. 

Switches  or  valves  are  simple  examples  of  such  devices 
Let  us  use  a  single  throw  switch  as  an  illustration. 

Thera  are  clearly  tvo  modes  of  failure  possible : 

a)  the  prematura  failure,  when  the  switch  closes  before  it  should,  and 

b)  the  dud  failure,  when  it  fails  to  close  at  the  specified  time. 

Let  the  probabilities  of  these  tvo  modes  of  failure  be  denoted  by  f 

and  fQ  respectively. 

Suppose  Instead  of  a  single  switch  we  use  tvo  switches  in  series: 

_ / _ / _ 

Nov  the  probability  of  premature  failure  vlll  be  given  by 
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and  the  probability  of  dud  failure  by 


F  -  1  -  (1  -  f  )2  -  2f  -  f  2 
o  o  o  o 


For  a  concrete  example,  let  fc  ■  f  ■  0.001.  Then 


F  -  0.000001  and 
c 

F  -  0.002, 
o  ' 

so  that  the  series  arrangement,  while  very  greatly  decreasing  the  probability 
of  premature  failure,  doublet  the  probability  of  dud  failure. 

When  the  two  switches  are  arranged  in  parallel  the  situation  is  reversed. 
Nov 

F  -  2f  -  f  2 
c  c  c 

F  -  f  2  . 

O  0 


This  suggests  a  series-parallel  arrangement  of  four  switches  as  shown 
below: 


For  this  network  the  premature  and  the  dud  failure  probabilities  sire 

2  2  2  k 

T  .  1  -  (1  -  f/)  .  2t  ‘  -  t  “  ,  ui 

c  c  c  c 

'o  •  -  (1  -  -  Wo5  *  foV  • 


given  by: 


-9- 


Agaln  let  us  assume  -  f  ■  0.001. 

Then,  for  a  single  switch  the  to  tel  probability  of  failure  is 

f  -  f  ♦  f  -  0.002  , 
c  o 

while  for  the  series-parallel  arrangement  of  four  switches 

F  -  F  ♦  F  -  0.000006  , 
c  o 

a  tremendous  improvement. 

Such  reliability  net  works  have  been  comprehensively  studied  at  Sandia, 
but  unfortunately  this  work  is  virtually  unknown,  and  very  little  use  is 
being  made  of  these  ideas  in  actual  design. 

Example  3 

Perhaps  the  most  striking  example  of  the  use  of  redundant  design  may 
be  illustrated  by  the  following  hypothetical  case. 

Assume  in  a  transport  airplane  a  communication  system  composed  of  two 
identical  VHF  sets,  two  identical  UHF  sets,  and  two  Identical  LF  sets. 

The  operations!  conditions  are  such  that  only  one  communication  channel 
is  to  be  used  at  any  one  time. 

Each  of  the  six  sets  la  a  complete,  self -contained  system.  The  equip¬ 
ment  inside  each  box  may  be  theoretically  divided  into  three  parts:  the  power 
package,  the  amplifier  section,  and  the  osolllator  section. 

Suppose  the  sets  were  redesigned  so  that  the  power  supplies,  amplifiers, 
and  oscillators  would  be  built  as  modules. 

With  proper  switching  there  would  be  now  eight  different  ways  of  operating 
on  each  of  the  three  frequency  bands,  instead  of  only  two. 


Considering  the  extremely  high  state  of  the  switching  art,  the  addition 
of  switching  should  have  a  negligible  effect  on  the  over-all  system  reliability. 
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Now,  let  u*  take  the  next  logical  atep. 

It  la  certainly  an  eaay  matter  to  design  a  general  purpoae  power  aupply 
capable  of  operating  any  of  the  aeta. 

Although,  more  of  a  problem,  the  deelgn  of  genvral  purpoae  anpllfiera  la 
alao  technically  feasible  with  a  eufficient  R  and  P  effort. 

We  now  could  replace  the  old  communication  ayatem  by  one  composed  of  two 
general  purpose  power  sources,  two  general  purpose  amplifier  sections,  six 
oscillator  sections,  and  proper  switching. 

Let  us  now  take  inventory. 

In  the  present  ayatem  we  have  an  equivalent  of  eighteen  pieces  of  equip¬ 
ment,  nine  different  equipment  types,  and  only  six  ways  of  getting  through. 

In  the  proposed  system  there  are  only  ten  pieces  of  equipment,  only  five 
different  equipment  types,  and  twenty-four  possible  channels  of  communication. 

A  comprehensive  use  of  the  above  design  philosophy  would  not  only 
increase  reliability,  but  would  also  decrease  equipment  weight  and  volume, 
reduce  frequency  of  failure,  decrease  maintenance,  decrease  spares  inventories, 
simplify  logistics,  reduce  procurement  requirements,  and  generally  lead  to 
tremendous  savings. 

Successful  implementation  of  a  reliability  program  involves  not  only 
technical  problems,  but  similarly  important  organ ixatiocal  and  management 
problems . 

Time  will  not  permit  me  to  go  Into  a  discussion  of  the  latter. 


